Phylogenomic branch length estimation using quartets

Abstract Motivation Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. Results In this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy. Availability and implementation CASTLES is available at https://github.com/ytabatabaee/CASTLES.


S1 Proofs for Branch Length Equations
Preliminaries.
Under the MSC, waiting times before coalescent events are exponential random variables with rate λ = k 2 , where k is the number of lineages entering an interval (Kingman, 1982). Therefore, the probability density function for the coalescence event between k lineages in an interval with length x is f X (x) = λe −λx = k 2 e −( k 2 )x ; i.e., e −x for two lineages, 3e −3x for three lineages, and 6e −6x for four lineages. The mean of this random variable is 1 which is 1 and 1 3 for two and three lineages respectively. Tavaré (1984) derived an equation for the function g ij (T ) as the probability that i lineages coalesce into j lineages in time T . Specific cases of this function (shown in Figures S1 to S4) are tabulated by Rosenberg (2002) and can be used to double-check our derivations.
Notations. As shown in Figures S1 and S2, we assume an unbalanced model species tree (((A, B) : T 1 , C) : T 2 , D) and a gene tree on the same leafset with an unrooted topology that either matches or does not match the topology of the species tree. In Figures S3 and S4, we work with the balanced species tree ((A, B) : T 1 , (C, D) : T 2 ). The parameters of the model species tree (i.e. µ i s and T i s) are defined according to Figure 1 of the main paper. We denote the expected length of the internal branch in a gene tree with unrooted topology ψ (where ψ is either ab|cd, ac|bd or ad|bc) by E(l I (ψ)), and the expected length of a terminal branch leading to taxa X by E(l X (ψ)). Figures S2 and S4 only show scenarios for non-matching gene trees that have the topology ad|bc, as the derivations for the other non-matching topology (ac|bd) is similar, and in all cases except for cherry branches, the expected lengths of a branch in these two topologies are the same.
We first compute the average expected length of the internal and terminal branches in a quartet gene tree for gene trees matching the topology of the species tree (denoted by L I for the internal branch, L X for the terminal branch leading to taxa X) as well as gene trees not matching the species tree (denoted by L ′ I for internal branch, L ′ X for terminal branch leading to taxa X) for both unbalanced and balanced model species trees.
To calculate these expected lengths, we consider all scenarios that lead to different branch lengths in matching or non-matching gene trees (summarized in Figures S1 to S4 for the internal branch) and their corresponding probabilities, and compute the following conditional expectation for the internal branch (similar expectations can be written for terminal branches, modifying l I to l X ) E(l I (ψ)) = E(l I |Ψ = ψ) = xf l I |Ψ (x|ψ)dx = 1 P(ψ) xf l I ,Ψ (x, ψ)dx where Ψ is a random variable denoting the unrooted gene tree topology, and P(ψ) is the probability of the specific topology ψ under the MSC, which can be computed as follows (Allman et al., 2011)  In some figures, one shape corresponds to more than one scenario (specified in the caption); we use this only when both scenarios give the same expected internal branch length and are derived by swapping two lineages. All the calculations (in particular, integrals) in the proofs are verified using Mathematica, and the notebook is available at https://github.com/ytabatabaee/CASTLES/blob/main/su branch calcs.nb.

S1.1 Unbalanced species trees (Theorem 1 and related lemmas)
Before proving Theorem 1, we introduce and prove four lemmas: All of our results and their proofs are in reference to Figures S1 and S2.
Lemma S1 (Internal unbalanced). For the unbalanced model species tree of Figure 1, the expected length of the internal branch of a gene tree with an unrooted topology matching the species tree in substitution units is L I = E(l I (ab|cd)) = (e −3T2 + 3e −T2 − 6e T1−T2 )(µ 2 − µ 3 ) + 6(1 − e T1 + T 1 e T1 )µ 1 2(3e T1 − 2) + µ 2 (S1) and the expected length for gene trees not matching the species tree topology is L ′ I = E(l I (ac|bd)) = E(l I (ad|bc)) = µ 2 + Lemma S2 (Terminal A or B (cherries), unbalanced). For the unbalanced model species tree of Figure 1, the expected length of the terminal edge A (equivalently, B) of a gene tree with an unrooted topology matching the species tree in substitution units is and the expected lengths for gene trees not matching the species tree topology are E(l A (ad|bc)) = and therefore L ′ A = 1 2 (E(l A (ad|bc)) + E(l A (ac|bd))) = 1 12 (10µ 2 − 9e −T2 (µ 2 − µ 3 ) − 3 −3T2 (µ 2 + µ 3 )) + T 1 µ 1 + T A µ A (S4) Lemma S3 (Terminal unbalanced C). For the unbalanced species tree, the expected length of the terminal edge C of a gene tree with an unrooted topology matching the species tree in substitution units is L C = E(l C (ab|cd)) = − e −T2 (µ 2 − µ 3 ) + µ 2 + µ C T C + 2µ 2 − 3e −T2 − e −3T2 (µ 2 − µ 3 ) − 4µ 3 e −3T2 6(3e T1 − 2) (S5) and the expected length for gene trees not matching the species tree topology is Lemma S4 (Terminal unbalanced D). For the unbalanced species tree, the expected length of the terminal edge D of a gene tree with an unrooted topology matching the species tree in substitution units is T 2 , D). L denotes the internal branch length in the gene tree and P denotes the probability of each case. Case (b) and (d) correspond to two scenarios and case (e) corresponds to four different scenarios, and the P values reported in these cases show the overall probability of all possible scenarios for that case.
Proof of Lemma S1. Referring to Figure S1, we can compute E(L I (ab|cd)) = T1 0 T2 0 e −x e −y ((T 1 − x)µ 1 + yµ 2 ) dy dx (scenario (a)) Similarly, referring to Figure S2, we can compute Proof of Lemma S2. Referring to Figure S1, we can compute Referring to Figure S2, we can compute and Proof of Lemma S3. Referring to Figure S1, we can compute And referring to Figure S2, we can compute Proof of Lemma S4. Referring to Figure S1, we can compute Referring to Figure S2, we can compute With these lemmas, we can now prove the main results.
Proof of Theorem 1. Subtracting Eq. S10 from Eq. S9, we get The average length of the terminal branch of A in a gene tree not matching the topology of the species tree is therefore the average of Eq. S12 and Eq. S13 Subtracting Eq. S19 from Eq. S11, we get ∆A = E(LA(ab|cd)) − 1 2 (E(LA(ad|bc)) + E(LA(ac|bd))) (S20) Subtracting Eq. S15 from Eq. S14, we get Subtracting Eq. S17 from Eq. S16, we get

S1.2 Balanced Species Tree
Similar to Theorem 1 for unbalanced trees, we now provide Theorem S1 for balanced trees.
Theorem S1 (Balanced). For the balanced model species tree of Figure 1, let ∆ I be the difference in the expected internal branch length in substitution units of gene trees with an unrooted topology matching the species tree and those not matching the species tree. Then, Similarly, let ∆ A be the difference in the expected length of matching and non-matching gene trees for the terminal branch leading to a cherry A.
Note that since all taxa in a balanced quartet tree are part of a cherry, B, C and D would follow similar equations as (S25) by substituting the appropriate µ and T values, following the symmetry of the tree. We first introduce and prove two lemmas, and then follow up with the proof of Theorem S1. All of our results and their proofs are in reference to Figures S3 and S4.
Lemma S5 (Internal balanced). For the unbalanced model species tree of Figure 1, the expected length of the internal branch of a gene tree with an unrooted topology matching the species tree in substitution units is ) and the expected length for gene trees not matching the species tree topology is Lemma S6 (Terminal (cherries), balanced). For the unbalanced model species tree of Figure 1, the expected length of the terminal edge A (equivalently, B) of a gene tree with an unrooted topology matching the species tree in substitution units is and the expected lengths for gene trees not matching the species tree topology are ) ) ) Figure S3: Scenarios for gene tree matching the balanced species tree. Internal branch lengths for unrooted quartet gene tree matching the balanced model species tree ((A, B) : T 1 , (C, D) : T 2 ). Here, L denotes the internal branch length in the gene tree and P denotes the probability of each case. Case (c) and (e) correspond to two scenarios and case (g) corresponds to four different scenarios, and the P values reported in these cases show the overall probability of all possible scenarios for that case. T 2 ). L denotes the internal branch length in the gene tree and P denotes the probability of each case. Case (b) corresponds to four different scenarios.
Proof of Lemma S5. Referring to Figure S3, we can compute For a balanced model tree, there are only two scenarios for a gene tree not matching the species tree, shown in Figure S4. The expected length of the internal branch in this case is Proof of Lemma S6. Referring to Figure S3, we can compute Similarly, referring to Figure S4, we can compute Using these lemmas, we now prove Theorem S1.
To compute the length of the internal branch, i.e. t 1 + t 2 , we simplify Equation (S24) by computing its limit as T 2 → 0 (note that by symmetry, T 1 → 0 can also be used and will lead to the same final formula).
which is exactly the same as Equation (5) for unbalanced trees. Note that by eq. (S27), we haveL ′ I = µ 3 , therefore, further assuming µ 1 = µ 3 allows us to estimate µ 1 as the average length of the internal branch in non-matching gene trees. We replace ∆ I with the observed difference between average internal branch lengths in matching and non-matching gene trees in Equation (S38), leading to the same equation as eq. (6) for unbalanced trees. The solution to this equation is based on the Lambert W function (see main text), which we approximate using a Taylor approximation as in Equation (7), leading to the following formula:T . Note that since we initially assumed T 2 → 0, the length of the internal branch in the limit only depends onT 1μ1 , and we use Equation (S39) as an estimate of t 1 + t 2 (the same estimator can be derived using T 1 → 0). For the terminal branch of A, assuming µ 3 → µ 1 , we can simplify Equations (S25) and (S29), as follows: We replace the expected value lim µ3→µ1 L ′ A in Equation (S41) with the observed mean difference∆ A and lim µ3→µ1 L ′ A with the observed meanL ′ A . Solving for µ A T A gives the following estimate for t A : Note that due to symmetry, all nodes in a balanced quartet are part of a cherry, and therefore the same equations and simplifications can be used for all terminal branches, only replacing the appropriate T s and µs. In particular, for C and D, we use a different assumption µ 3 → µ 2 , so that the final equation depends on T 2 and µ 2 , i.e. parameteres of the branch above the cherry. Tables S1 to S3 summarize the expected lengths, simplified formulas and the final branch length estimators for both unbalanced and balanced trees.  Figure S5: Root branch calculation based on balanced or unbalanced quartets. The root branch in the unrooted species tree corresponds to two branches in the rooted tree (the highlighted branches). Our equations calculate the sum of the two root-adjacent branches, that correspond to a single branch in the unrooted species tree. Hence, the length of the individual parts of this branch, and therefore the exact position of the root, is not inferred in our approach. a) When none of the children of the root is a leaf, then the length of the root branch can be computed from balanced quartets, using the internal branch equation. b) When one child of the root is a leaf, the root branch length can be calculated using the equations for the root-adjacent terminal branch (referred to as terminal D in the proofs) from unbalanced quartets. Table S1: Summary of formulas for expected branch lengths in matching and non-matching gene trees.

Parameter
Simplified formula Simplifying assumption Table S3: Summary of formulas for estimating unbalanced or balanced species tree branch lengths in SU. Note that for an unbalanced species tree of Figure 1, the internal branch has length t 1 in SU, but for the balanced species tree it has length t 1 + t 2 , as it is the root branch. For each branch X,L X andL ′ X refer to the observed mean branch length in matching or non-matching gene trees. For both balanced and unbalanced internal branch,δ is defined asδ =∆ Ī L ′ I . The length of the internal branch does not depend on CU lengths, but the equations for the terminal branches use the CU length of the adjacent internal branch (i.e. T 1 for unbalanced tree and T 1 + T 2 for balanced tree) which is computed using the approach from Sayyari and Mirarab (2016) inside CASTLES.

Parameter
Estimation formula Simplifying assumption(s)

S2 Dynamic Programming Algorithm
In this section, we will provide an O(n 2 ) large tree algorithm for computing branch lengths for all internal and terminal branches. For conciseness, we use A, B, C, D to denote sets of taxa and use a, b, c, d to denote individual taxa. To compute all branch lengths, it is sufficient to compute the following counters for a set of ordered leafset quadripartitions, in which each leafset quadripartition (A, B, C, D) -up to permutations -corresponds to an internal branch: • n(A, B; C, D): the number of quartet and gene tree combinations • a(A; B; C, D): the total length of the terminal branches leading to A in quartet trees in the form All three counters for each quadripartition (A, B, C, D) can be computed in a single post-order traversal of the gene tree nodes in O(n) using Algorithm S1. Therefore, computing counters for all O(n) quadripartitions has time complexity O(n 2 ). Notice that Algorithm S1 assumes that all input gene trees are fully resolved. Otherwise, input gene trees should be arbitrarily resolved by adding ghost branches of zero lengths (not counted towards n(A, B; C, D)).
Algorithm S1 Large tree algorithm. The input is a set of gene trees G and an ordered quadripartition of its leafset (A, B, C, D), and the output are n(A, B; C, D), x(A, B; C, D), and a(A; B; C, D). For each node u we keep a list of counters C · · (u) described in Table S4.
if u corresponds to a taxon in A then 4: else if u corresponds to a taxon in B then 7: else if u corresponds to a taxon in C then 10: Update all counters C · · (u) using the recursive formula in Table S4 25: end for 31: end procedure Table S4: We define several counters for every node w, each of which computes e∈S(w) f (e) for some S and f . We define several notations: let p denote the parent node of w and u, v denote the child nodes of w; let L(w) denote the set of leaves under w; let D(·, ·) denote the distance of two nodes; let M(·, ·) denote the most recent common ancestor of two nodes; let H(·, ·) be the 0/1 indicator of whether all branches on the path between the two nodes are ghost branches. Superscripts − and + signify ghost branches and all branches, respectively. Note that A and B are interchangeable in the names; e.g., Here, we omit the column S(w). For all C · AB|CD (w) show below, S(w) is the set of quartets a, b, c, d, whose MRCA is w, form a tree with topology ab|cd and are elements of sets A, B, C, D, respectively.

Counter
Function f (e) Recursive formula Similarly defined Illustra�on of Recursive Formula Figure S6: Illustration of counters and their recursive formulas. Branches colored red are counted by lengths, and dotted branches must be ghost branches.

S3.1 Quartet Simulations
We used a modified version of SimPhy (Mallo et al., 2016), available at https://github.com/ytabatabaee/ CASTLES/tree/main/simulation files, that outputs species trees with substitution units branch lengths. We simulated quartet species trees and true gene trees under six different model conditions using the following commands:

S3.2 Software Commands and Version Numbers
In this section, we bring the details of the experimental pipeline and software commands. The code for CASTLES as well as the scripts for error and distance matrix calculation use functions from DendroPy (Sukumaran and Holder, 2010). All experiments were run on the University of Illinois campus cluster.

S3.2.1 Branch Length Estimation
We used the following commands to estimate branch lengths in substitution units on a given species tree topology: • CASTLES: Running CASTLES is a two-step approach: 1) Annotate branches of the species tree with mean quartet branch lengths around it using the tool ASTER, and 2) assign final branch lengths to each branch using the castles.py (v1.0.0) code available at github.com/ytabatabaee/ CASTLES/blob/main/castles.py.
Step 1) The algorithm to annotate branches with mean lengths of quartets around it is implemented in ASTER (v1.13.2.4) available at https://github.com/chaoszhang/ASTER. To annotate a fixed species tree topology with quartet statistics for each branch, we ran it with the option -C to score a fixed species tree specified after -c. Note that ASTER needs to be complied with a particular option for this annotation to work: g ++ -std = gnu ++11 -D " ASTRALIV " -march = native -Ofast -pthread src / astral . cpp -o bin / astral Assuming this version is used, we used the following command, where the annotated tree is printed to the log file: Particularly, when we have multiple individuals per species and the individual names do not match the species names, we run the following command: where the "name map" file contains maps from individual names to species names in the following format: Step 2) The annotated tree is given to CASTLES to compute branch lengths in substitution units from these quartet statistics. To run CASTLES, we used the following command (note that the log file from ASTER is directly passed to CASTLES as input): python3 castles . py -t annotated . tre -g < gene_tree_path > -o < output_path > • FastME+Mean or FastME+Min: Running FastME is also a two-step approach: 1) estimating the distance matrix, 2) inferring branch lengths.
Step 1) We used our custom script to compute patristic (path-length) distance matrices from gene trees in Phylip format. The core of this script is the PhylogeneticDistanceMatrix class from Dendropy. The script is available at https://github.com/ytabatabaee/CASTLES/blob/main/ scripts/patristic dist matrix.py. The option -m specifies the type of distance matrix: 'avg' and 'min' compute a single squared matrix corresponding to the average and minimum path-length distances between pairs of nodes in a set of gene trees, respectively.
Step 2) We used FastME (v2.1.6.2) (Lefort et al., 2015) available at http://www.atgc-montpellier.fr/ fastme/ to assign branch lengths with balanced minimum evolution (BME) criteria (specified with -w BalLS) to a given species tree topology, specified with -u, given a single distance matrix corresponding to average or minimum patristic distances computed across a set of gene trees, using the following command: • ERaBLE: Similar to FastME, ERaBLE requires pre-computation of distances. We used the same script as FastME but with option 'all' to compute one patristic distance matrix per gene, for a set of gene trees. We used the following command: We then used ERaBLE (v1.0) (Binet et al., 2016) available at http://www.atgc-montpellier.fr/erable/. The input to ERaBLE is an unrooted tree topology in newick format, specified with the option -t, and a set of k distance matrices in Phylip format, each corresponding to a single gene tree, generated above. We used the following command: • RAxML: We used RAxML (v8.2.12) (Stamatakis, 2014) to estimate branch lengths on a given species tree topology, using a concatenated sequence alignment, with the option -f e. RAxML is available at https://github.com/stamatak/standard-RAxML. We used the following command: raxmlHPC -PTHREADS -f e -t < species_tree_path > -m GTRGAMMA -s < alignment_path > -n RES -p 4321 -T 16

S3.2.2 Error Calculation
We used the script available at https://github.com/ytabatabaee/CASTLES/blob/main/scripts/ compare trees bl.py to compare branch lengths on two trees, and compute the root mean squared error (RMSE) and the average logarithmic error between the trees for all branches. We used the following command: python3 compare_trees_bl . py -t1 < true_species_tree_path > -t2 < est_species_tree_path > which tabulates the set of true and estimated branch lengths. These data are then analyzed using an R script available at https://github.com/ytabatabaee/CASTLES/blob/main/results/draw.R Note that for all methods, before computing the error, negative and zero branch lengths are replaced with 1e-6. The formulas for the error metrics used in this study for a species tree T with b branches are as follows, where t i andt i are the true and estimated lengths of branch i in SU respectively:

S3.2.3 Runtime and Memory
We measure runtime as the total running time of all steps of each method, assuming that the estimated gene trees, alignments, and species tree topology are already available. This includes the time needed to calculate the distance matrices for ERaBLE and FastME, as well as the time needed for annotating trees with ASTER that is given as input to CASTLES. To measure the runtime and peak memory usage of a command <cmd>, we use the following command: We record runtime from the elapsed wall clock time, and peak memory usage from maximum resident set size from the generated out.stat file.

S3.3 Biological Data Analysis
We analyzed the mammalian biological dataset of Song et al. (2012), which had 37 species (36 ingroups and one outgroup). The original dataset had 447 genes, but we used a processed version of the dataset with 424 genes available here, that had 23 mislabeled or outlier genes removed. We estimated an unrooted species tree using ASTRAL (v5.7.8) (Zhang et al., 2018) available at github.com/smirarab/ASTRAL using the following command.
java -jar astral .5.7.8. jar -i < gene -trees . tre > -o < species -tree . tre > We then removed the outgroup (Chicken, specified with the name "GAL" in the dataset) from the ASTRAL tree, as our simulations show that branch length estimation methods benefit from the removal of outgroup, and then estimated branch lengths on the tree. All files associated with this analysis are available at https://github.com/ytabatabaee/CASTLES/tree/main/results/mammalian-biologicalanalysis.  Figure S9: Mean absolute error on terminal and internal branches for simulated 101-taxon datasets. The plot shows mean and standard error across 50 replicates, in addition to boxplots. The y-axes is cut at 0.07, eliminating a few outlier cases with unusually high errors (none from CASTLES). Figure 3.a in the main text shows the same error aggregated for terminal and internal branches.  Figure S10: Mean absolute error (top) and mean log error (bottom) on terminal and internal branches for simulated MVRoot datasets. The plot shows mean and standard error across 100 replicates, in addition to boxplots. The y-axis is cut at 0.11, leaving a few outliers out of the graph (one from CASTLES). Patristic(MIN)+FastME has mean log error above 2 (see Fig. S18) and is excluded. Figure 4 in the main text shows the same errors aggregated for terminal and internal branches.  Figure S13: Correlations between true and estimated branch lengths on the S100 datasets. Each dot corresponds to a single branch in an unrooted 101-taxon species tree, and the results are shown across 50 replicates (therefore 50 × 199 points in each subfigure). The lines show a fitted degree-two polynomial with smoothing.  Figure S14: RMSE (top) and mean log error (bottom) of branch lengths estimated using different methods on simulated S100 datasets. Both panels show mean and standard deviation across 50 replicates, in addition to boxplots. The average GTEE level varies between zero (for true gene trees) to 23% (for 1600bp) and then to 55% (for the 200bp sequences). The average ILS level is 0.46 AD, and the number of genes is 1000.   Figure S17: Mean log error (in base 10) of branch lengths estimated using different methods (excluding Patristic(MIN)+FastME) versus ILS measured using AD (mean RF distance of true gene trees to the model species tree) on the MVRoot dataset in the no-outgroup model conditions. The number of genes is 500 and the results are shown across 100 replicates.  Figure S18: Mean absolute error (top) and mean log error (bottom) of branch lengths estimated using different methods on MVRoot datasets. Focusing on cases without outgroups, we divide replicates based on their level of true gene tree discordance due to ILS into four groups. The number of genes is 500 and the results are shown across 100 replicates. The main paper excluded Patristic(MIN)+FastME due to very high error.  Figure S19: (a) Visualization of the true species tree (middle), and trees with branch lengths produced by CASTLES (bottom) and Concat+RAxML (top) for replicate 1 of MVRoot dataset in the default model condition (no outgroup, medium deviation from the strict clock). The ILS level and GTEE level for this replicate are 0.53 and 0.34 respectively. (b) Same methods applied on the real mammalian dataset (Song et al., 2012). Here, the outgroup (Chicken) is removed before drawing branch lengths on the tree. The trees are visualized using FigTree (Rambaut, 2023). Branch Type terminal internal Figure S20: Correlations between branch lengths produced by CASTLES and Concat+RAxML on the 37-taxon mammalian dataset (Song et al., 2012). The branch lengths are drawn on a species tree topology constructed by ASTRAL. The outgroup (Chicken) is removed before drawing branch lengths on the tree. The number of genes used in the analysis is 424. Each point corresponds to a single branch in the species tree. The colored lines show a fitted degree-two polynomial.